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Abstract 

The purpose of this article is to provide an adaptive estimator of the baseline function in the 
Cox model with high-dimensional covariates. We consider a two-step procedure : first, we estimate 
the regression parameter of the Cox model via a Lasso procedure based on the partial log-likelihood, 
secondly, we plug this Lasso estimator into a least-squares type criterion and then perform a model 
selection procedure to obtain an adaptive penalized contrast estimator of the baseline function. 

Using non-asymptotic estimation results stated for the Lasso estimator of the regression param¬ 
eter, we establish a non-asymptotic oracle inequality for this penalized contrast estimator of the 
baseline function, which highlights the discrepancy of the rate of convergence when the dimension 
of the covariates increases. 

Keywords: Survival analysis; Conditional hazard rate function; Cox’s proportional hazards model; 
Right-censored data; Semi-parametric model; Nonparametric model; High-dimensional covariates; 
Model selection; Non-asymptotic oracle inequalities; Concentration inequalities 


1 Introduction 

Consider the following Cox model, introduced by Cox (1972) and defined, for a vector of covariates 
Z = (Z \,..., Z p ) T , by 

X 0 (t,Z) = a 0 (t)exp(/3^Z), (1) 

where Ao denotes the hazard rate, /3o = (A)i, ■■■■,Po p ) T G M p is the regression parameter and ao is the 
baseline hazard function. The Cox partial log-likelihood, introduced by Cox (1972), allows to estimate 
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/3o without the knowledge of ao, considered as a functional nuisance parameter. For the estimation 
of ao, one common way is to use a two step procedure, starting with the estimation of /3o alone and 
then to plug this estimator into a non parametric type estimator ao, usually a kernel type estimator. 

Let us be more specific. 

When p is small compared to to, /3o is usually estimated by minimization of the opposite of the 
Cox partial log-likelihood. We refer to Andersen et al. (1993), as a reference book, for the proofs of 
the consistency and the asymptotic normality of f3 when p is small compared to n. Thoses strategies 
only apply when p < n and even more, they only apply when p is small compared to n. When p 
growths up, becoming of the same order as n and possibly larger than to, various well known problems 
appears. Among them, the minimization of the opposite of the Cox partial log-likelihood becomes 
difficult and even impossible if p > n. 

In high-dimension, when p is large compared to to, the Lasso procedure is one of the classical 
considered strategies. The Lasso (Least Absolute Shrinkage and Selection Operator) has been first 
introduced by Tibshirani (1996) in the linear regression model. It has been largely considered in 
additive regression model (see for instance Knight and Fu (2000), Efron et al. (2004), Donoho et al. 
(2006), Meinshausen and Buhlmann (2006), Zhao and Yu (2006), Zhang and Huang (2008), Mein- 
shausen and Yu (2009) and also Juditsky and Nemirovski (2000), Nemirovski (2000), Bunea et al. 
(2006; 2007a;b), Greenshtein and Ritov (2004) or Bickel et al. (2009)), and in density estimation (see 
Bunea et al. (2007c) and Bertin et al. (2011)). In the particular case of the semi-parametric Cox 
model, Tibshirani (1997) has proposed a Lasso procedure for the regression parameter. The Lasso 
estimator of the regression parameter j3 is defined as the nrininrizer of the opposite of the Cox partial 
log-likelihood under an t\ type constraint, that is, suitably penalized with an £i-penalty function. 
Recent results exist on the estimation of /3o in high-dimension setting. Among them one can mention 
Bradic et al. (2012) who have proved asymptotic results for Lasso estimator. More recently, Bradic 
and Song (2012), Kong and Nan (2012) and Huang et al. (2013) establish the first non-asymptotic 
oracle inequalities (estimation and prediction bounds) for the Lasso estimator. 

For the baseline hazard function and when p is small compared to n, the common estimator is a 
kernel estimator, which depends on /3 obtained by minimization of the opposite of the Cox partial log- 
likelihood. This kernel estimator has been introduced by Ramlau-Hansen (1983a;b) from the Breslow 
estimator of the cumulative baseline function (see Ramlau-Hansen (1983b) and Andersen et al. (1993) 
for more details). In this context, Ramlau-Hansen (1983b) and Gregoire (1993) proved asymptotic 
results. No non-asymptotic results and no adaptive results have to date been established for the 
kernel estimator of the baseline function. Finally, when p is large compared to ra, to our knowledge, 
the construction of an estimator of the baseline function has not been yet considered. 

In this paper, we consider a two-step procedure to estimate /3o and ao, the two parameters in 
the Cox model. But our contributions focus more on the estimation of ao- In the Cox model we 
consider, it is noteworthy that the high-dimension only concerns the regression parameter, whereas 
the baseline function is a time function. Its estimation would not require a procedure specific to 
high-dinrension, besides the first step concerning the estimation of /3o- We propose a procedure for 
the construction of an estimator of the baseline hazard function ao, p being either smaller than n or 
greater than n. It combines a Lasso procedure for /3o as a first step and a second step based on a 
model selection strategy for the estimation of the baseline function ao- This model selection procedure 
takes its origins in the works of Akaike (1973) and Mallows (1973), more recently formalized by Birge 
and Massart (1997) and Barron et al. (1999) for the estimation of densities and regression functions 
(see the book of Massart (2007) as a reference work on model selection). In survival analysis, the 
model selection has also been documented. Letue (2000) has adapted these methods to estimate the 
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regression function of the non-parametric Cox model, when p < n. More recently, Brunei and Comte 
(2005), Brunei et al. (2009), Brunei et al. (2010) have obtained adaptive estimation of densities in a 
censoring setting. Model selection methods have also been used to estimate the intensity function of a 
counting process in the multiplicative Aalen intensity model (see Reynaud-Bouret (2006) and Comte 
et al. (2011)). However, the model selection procedure has never been considered, to our knowledge, 
for estimating the baseline hazard function in the Cox model. 

Our contributions are at least threefold: Our procedure is the first that focus on the estimation of 
baseline function of the semi-parametric Cox model with high-dimentional covariates. This procedure 
provide an adaptive estimator of the baseline function that works as well for small p and large p 
compared to n (that is for possibly high-dinrensional covariates). Furthermore, for this estimator, we 
state non-asymptotic oracle inequalities, that hold, once again, p being either smaller than n or greater 
than n. More precisely, we prove that the risk of this estimator achieves the best risk among estimators 
in a large collection. For each model, the risk of an estimator is bounded by the sum of three terms. 
The first term is a bias term involving to the approximation properties of the collection of models, 
through the distance evaluated in /3o between the true baseline and the orthogonal projection of ao 
on the best selected model. The second term is a penalty term of the same order than the variance 
on one model, that is of order the dimension of one model over n, as expected with ^o~P ena lty- These 
two terms are the "usual" terms appearing in nonparametric estimation. It is noteworthy that these 
two terms do not involve any quantity related to the risk of the Lasso estimator of /3o- The last term 
precisely comes from the properties of the Lasso estimator of (3q. This last term is of order log(np)/n, 
as expected for a Lasso estimator. 

When p is small, the third last term is of order log(n)/n and, the rate is governed by the first 
two terms. In that case, the penalty term being of the same order than the variance over one model, 
we conclude that the model selection procedure achieves the "expected rate" of order 
when the baseline function belongs to a Besov space with smoothness parameter 7 . This continues 
to hold when p is of the same order than the sample size n. When p is larger than n, that is in 
the so-called ultra-high dimension (see Verzelen (2012)), the rate for estimating ao is changed, and 
more precisely degraded as a price to pay for being with high dimension covariates. This degradation 
follows accordingly to the order of p compared to n. 

The main tools for stating our results are the theory of marked counting processes and martingales 
with jumps, the theory of penalized minimum contrast estimators and concentrations inequalities 
such as Talagrand inequality (see Talagrand (1996)) and a Bernstein inequality found in (see van de 
Geer (1995) and Comte et al. (2011)) for unbounded martingale process and combined with chaining 
methods (see Talagrand (2005) and Baraud (2010)). 

The article is organized as follows. In Section 3, we describe the estimation procedure. Section 4 
provides non-asymptotic oracle inequalities on the estimator of the baseline hazard function ao, in a 
high-dinrensional setting for /3q. In section 5, we compare the performances of the resulting penalized 
contrast estimator to those of the usual kernel estimator on simulated data. Section 6 is devoted to 
the proofs: we state some technical results, then we establish the two main theorems and lastly we 
prove the technical results. Finally, Appendix A discusses the bound of the error estimation for the 
Lasso estimator of the regression parameter of the Cox model. 
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2 Notations and preliminaries 

2.1 Framework with counting processes 

Consider the general setting of counting processes, which embeds the classical case of right censoring. 
We follow here the now classical setting of Andersen et al. (1993) or Fleming and Harrington (2011). 
For n independant individuals, we observe for i = 1, ...,n a counting process W, a random process Y t 
with values in [0,1] and a vector of covariates Z\ = (Z^i ,..., Zi jP ) T £ M p . Let (fi, F, P) be a probability 
space and (Ft)t >o be the filtration defined by 

F t = a{Ni(s),Yi(s),0 < s <t,Z i: i = 1 ,...,n}. 

From the Doob-Meyer decomposition, we know that each N{ admits a compensator denote by A,;, 
such that M % = JVj — A* is a (Ft)t >o local square-integrable martingale (see Andersen et al. (1993) for 
details). We assume in the following that Ni has a satisfies an Aalen multiplicative intensity model. 

Assumption 2.1. For each i = 1,..., n and all t > 0, 

Ai(i) = f* \ Q (s, Zi)Yi(s)ds, (2) 

Jo 

where Xq( t,z) = ao(t)e^ Tz , for z £ M p . 

We observe the independent and identically distributed (i.i.d.) data (Zj, W(i), Yj(t), i = 1,..., n, 0 < 
t < t), where [0, r] is the time interval between the beginning and the end of the study. 

This general setting, introduced by Aalen (1980), embeds several particular examples as censored 
data, marked Poisson processes and Markov processes (see Andersen et al. (1993) for further details). 
We give here details for the right censoring case. We observe for i = l,...,n, (W,<5j, Zf), where 
X t = nrin(Tj,C'j), 5i = T) is the time of interest and C{ the censoring time. With these 

notations, the (J r ))t>o-adapted processes Y t and W are respectively defined as the at-risk process 
Yi(t) = 1 { Xi >t} an d the counting process N t (t) = i{x l <t.S l —i\ which jumps when the ith individual 
dies. 

2.2 Assumptions 

Before describing the estimation procedure, we introduce few assumptions on the framework defined 
in Subsection 2.1. 

Let Z6l p denote the generic vector of covariates with the same distribution as the vectors of 
covariates Zj of each individual i and by Zj its j-th component, namely the )-th covariates of the 
vector Z. Similarly, we denote by Y the generic version of the random process Y with values in [0,1]. 
We define the standard L 2 and L°°-norms, for a € (L 2 n L°°)([0,r]): 

||cr|I2 = / a 2 (t)dt and 11cr||oo,-r = sup |<a(f)|. 

•'O tG[0,r] 

For a vector b £ M p , we also introduce the Li-norm 16| 1 = J2j= 1 \bj\- 
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Assumption 2.2. 

(i) There exists a positive constant B such that 

\Zj\<B, Mj E {1, 

In the following, we denote A = [— B,B] P . 

(ii) The vector of covariates Z admit a p.d.f. fz such that sup^ \ fz\ < fi < +oo. 

(in) There exists fo > 0, such that V(t,z) E [0,r] x A, 

E[Y(t)\Z = z\f z {z) > fo- 

(iv) For all t E [0,r], ao(t) < ||ao||oo,r < +oo. 

Remark 2.3. Let say a few word on these assumptions starting by noting that these four assumptions 
are quite classic and reasonnable. To be more specific, Assumption 2.2. (i), is very common to establish 
oracle inequalities of Lasso estimators in various frameworks. In particular, in the Cox model, see 
e.g. Huang et al. (2013) and Bradic and Song (2012) for the statement of non asymptotic oracle 
inequalities 

In the specific case of right censoring, Assumption 2.2. (in) is automatically verified. Indeed, for 
T the survival time and C the censoring time, we can write 

E(Y(t)\Z = z) = E(l {TAC < t} |Z = z) = (1 - F Tjz (t))(l - G c \z(t -)), 

where Ft\z and Gc\z are the cumulative distribution functions of T\Z and C\Z respectively. It is 
known (see Andersen et al. (1993)) that the Kaplan-Meier estimator is consistent only on intervals 
of the form [0,r], where r < sup{i > 0, (1 — F T \ z (t))(l — Gc\z(f )) > 0}- Hence when fz is bounded 
from below on A, there exists fo > 0, such that 

\/(t,z) E [0,r] x A, E[Y(t)\Z = z\f z (z) > f 0 . 

Assumption 2.2. (Hi) is required in order to compare the natural norm of the baseline function 
induced by our contrast to the standard E 2 -norm (see Proposition 6.1). 


3 Estimation procedure 

We now describe our two-steps estimation procedure, starting by recalling the Lasso estimation of /3o 
and then giving a bound of its prediction risk. Then, we describe the contrast and the model selection 
procedure for the estimation of the baseline function. 

3.1 Preliminary estimation of /3 0 : procedure and results 

The Lasso estimator (3 of the regression parameter /3q, introduced in Tibshirani (1997), is defined by 

P = argmin{—Z*(/3) +T n |/3|i}, (3) 

pew p 
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where r n is a positive regularization parameter to be suitable chosen, |/3|i = J2j=i \Pj\ and C is the 
Cox partial log-likelihood defined by, 

1 . n r T p /3 T Zi 1 n _ 

l*n(P) = ~J2 log c i+ where S n (t, fi) = — ^ W*(t) Vt > 0. (4) 

n i= 1 "tO P) n i=1 

The risk bounds for the estimator of a® will naturally involve the risk |/3 — /3o|i, that have to be 
at least bounded. Thus, we rather consider the following procedure 

$ = argmin {-l*(P) +pen(/3)}, with pen(/3) = T n |/3|i, (5) 

^eB(o,Ri) 

where 13(0, Ri) is the ball defined by 

13(0,i?i) = {6 E M p : |6|i < Ri}, with Ri > 0. 


Consider the following assumption: 

Assumption 3.1. We assume that |/3o|i < R 2 < + 00 . 

We denote R = max(i?,i, i? 2 ), so that 

\$ ~ /^oli < 21 ? a.s. ( 6 ) 


Such condition has already been considered by van de Geer (2008) or Kong and Nan (2012). Roughly 
speaking, it means that we can restrict our attention to a ball, possibly very large, in a neighborhood 
of /3o for finding a good estimator of /3o- 

As mentionned above, our risk bounds for the estimator of ao depend on the risk |/3 — /3o|i- Such 
bounds on this risk already exist. In particular, in their Theorem 3.1, Huang et al. (2013) state a non 
asymptotic inequality for \/3 — /3o|i in the specific case of bounded counting processes. We consider here 
more general processes, possibly unbounded. In the following proposition, we provide a generalization 
of the results established by Huang et al. (2013) to the case of unbounded counting processes. We 
refer to Appendix A for a proof of Proposition 3.2. 

Proposition 3.2. Let k > 0, c > 0 and s := Card{j E {l,...,p} : /Jcr, 7 ^ 0} be the sparsity index of 
/3o- Assume that ||ao||oo,r < 00 . Then, under Assumptions 3.1 and (i), with probability larger than 
1 — cn~ k , we have 

(T) 

V n 

where C(s) > 0 is a constant depending on the sparsity index s. 

As mentioned previously, this proposition is crucial to establish a non-asymptotic oracle inequality 
for the baseline function. In the rest of the paper, we consider that 0 satisfies Inequality (7). 


Assumption 3.3. We assume that 


lim C(s) 

n—>• 00 


log (rap) 


= 0 . 


n 


This assumption is clearly reasonable: when p is smaller than n or of the same order, this as¬ 
sumption is automatically fulfilled. It is not satisfied when p becomes too high compared to n. This 
case corresponds to the now well known case of ultra-high dimension framework. In this specific case, 
recent lower bounds in additive regression models typically say that the estimation of paramater is 
mostly impossible (see for example Verzelen ( 2012 )). 
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3.2 Estimation of ao 

We now come to the estimation of the baseline function ao via a model selection procedure. As usual, 
such a procedure requires an empirical estimation criterion, a collection of models and a suitable 
penalty function, all being presented in the following. 

3.2.1 Definition of the estimation criterion 

We estimate the baseline function ao using a least-squares criterion. More precisely, based on the data 
(Zi, Ni(t),Yi(t),i = l,...,n, 0 < t < r) and for a fixed /3, we consider the empirical least-squares 
type given for a function a G (L 2 n L°°)([0, r]) by 

C n (a, /3) = f oc(t)dNi(i) + -^2 [ a 2 {t)^ T Zi Yi(t)dt. (8) 

n i=i J 0 n i=i J o 

The use of such least-square empirical criterion in survival analysis is not so usual as for the additive 
regression model. Nevertheless, few recent studies have developped such very useful as strategies. 
Among them one can cite Reynaud-Bouret (2006) or Comte et al. (2011). 

Let us define a deterministic scalar product and its associated deterministic norm for oq, 0:2 and 
a functions in (L 2 n L°°)([0,r]): 

(cn,a 2 ) d et(p) = a 1 (t)a 2 (t)K[e 0Tz Y(t)]dt, 

\Mdet(p) = f Q a 2 {t)E[e^ z Y{t)}dt. (9) 

Using the Doob-Meyer decomposition A) = Mj + Aj and according to the multiplicative Aalen 
model (2), we get: 

lL[Cn(a, (3 0 )] 11Ck| \ de i CKO )det 1^ol I det I l^o| I deti 

which is minimum when a = ao- Hence, minimizing C n (., /3q) is a relevant strategy to estimate ao- 

3.2.2 Model selection 

We now describe the model selection procedure in our context, introducing first the collection of 
models. 

Collections of models. Let A4 n be a set of indices and {S m ,m G A4 n } be a collection of models: 

S m = {a:a='£ a™<p™,a™ eR}, 

jdzJm 

where (v , )”)jeJ m is an orthonormal basis of (L 2 nL°°)([0,r]) for the usual L 2 (P)- norm. We denote 
D m the cardinality of S m , i.e. \J m \ = D m . 

Sequence of estimators. Let us consider /3 the Lasso estimator of /3o defined by (5). For each 
m G A i n , we define the estimator 

= argmin {C n (a,/3)}. (10) 

OL^Sm 
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Model selection. The relevant space is automatically selected by using following penalized criterion 


mP = argmin{C n (d^ l , 0) + pen(m)}, 

m£M n 

where pen : A4 n —>■ M will be defined later. 


( 11 ) 


Final estimator. The final estimator of a o is then d^ -. 

mP 

Let us say few words on the optimisation problem. Denote by G the random Gram matrix 

1 


'{j, 




( 12 ) 


By definition, the estimator af n is the solution of the equation G^A^ = T mi where 

A A -i TL n -J- 

A L = ^j)j&Jm and r m=(~Y, <Pj(t)dNi(i)) . 

~l JO J j£Jm 

The Gram matrix G ^ may not be invertible in some cases. Hence we consider the set 
T//3 f • c tnB ^ ^ ( /oe“ B|/3 o |l e^ B ^o-^li 1 \) 

= \ mm Sp(G£J > max - - - 


(13) 


n 


(14) 


where Sp (M) denotes the spectrum of matrix M and /o satisfies the following assumption: 


Assumption 3.4. There exist a preliminary estimator /o of fo and two positive constants Co > 0, 
no > 0 such that 

P(|/o - /o| > /o/2) < C 0 /n 6 for any n > n 0 . 

From Assumptions 3.1, on the set the matrix G@ is invertible and is thus uniquely defined 


as 


— j arg mm ae s m {C n (a, $)} on 
m \ 0 on ( nir . 


3.2.3 Assumptions and examples of the models 

The following assumptions on the models {S'm : m E A4 n } are usual in model selection procedures. 
They are verified by the spaces spanned by usual bases: trigonometric basis, regular piecewise polyno¬ 
mial basis, regular compactly supported wavelet basis and histogram basis. We refer to Barron et al. 
(1999) and Brunei and Comte (2005) for other examples and further discussions. 

Assumption 3.5. 

(i) For all m £ M. n , we assume that 




(ii) For all m E M. n , there exists 4> > 0 such that for all a in S m , 



(Hi) The models are nested within each other: D mi < D m2 =>■ S mi C S m2 . We denote by S n the 
global nesting space in the collection and by T> n its dimension. 

Remark 3.6. Assumption 3.5. (i) ensures that the sizes D m of the models are not too large compared 
with the number of observations n. This assumption seems reasonable if we remember that D m is the 
number of coefficients to be estimated: if this number is too large compared to the size of the panel, 
we cannot expect to obtain a relevant estimator. Assumption 3.5. (ii) implies a useful connection 
between the standardly 2 -norm and the infinite norm. Assumption 3.5.(Hi) ensures that\/m,m' E A4 n , 
S m + S m i C S n . Thanks to this assumption, one does not have to browse through all models for the 
model selection, which reduces the algorithmic complexity of the procedure. In addition, we have from 
Assumption 3.5. (i) that V n < y/n/logn. 

4 Non-asymptotic oracle inequalities 

We now are in a position to state our main theorem: a non-asymptotic oracle inequality for the 
estimator a^ » of the baseline function in the Cox model. 

rhP 

Theorem 4.1. Let Assumptions 2.2. (i)-(iv), Assumptions 3.1, Assumption 3.3, Assumption 3.f and 
Assumptions 3.5. (i)-(iii) hold. Let be the projection of ao on S m with respect to the deterministic 
scalar product when /3 q is known: 

a = argmin E [C n (a,f3 0 )\ = argmin||a — aolldet- (15) 

aeSm a&Sm 

Let a, 0 be defined by (10) and (11) with 

pen(m) := K 0 ( 1 + ||ao||oo,r) — , (16) 

n 

where Kq is a numerical constant. Then, for any n > no, with no a constant defined in Assumption 

3.4, 

E 0I«^ -a°llL] {I l a 0 - I |L + 2 pen(m)} + ^ + ^2^(5) log ^ , (17) 

where kq is a numerical constant, C\ and C 2 are constants depending on t, <f>, ||ao||oo,r, fo, E[e^o z ], 
Efe 2 ^' 1 E[e 4/3 o z ], B, |/3o|i, the sparsity index s of /3o and kj, a constant from the Burkholder In¬ 

equality (see Theorem 6.9) and C(s) the constant depending on the sparsity index of /3o in Proposition 

3.2. 


Inequality (17) provides the first non-asymptotic oracle inequality for an estimator of the baseline 
function. This inequality warrants the performances of our estimator ciy^. We refer to Subsection 
6.2.1 for precisions about C\ and Ci- In Inequality (17), the risk is bounded by the sum of four terms. 

The third term of order 1/n is negligible compared to the others. The first two terms are respec¬ 
tively the bias and the variance terms. The bias term, ||ao — «fi° \\det, corresponds to the approximation 
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error and decreases with the dimension D m of the model S m . It depends on the regularity of the true 
function, which is unknown: the more regular «o is, the smaller the bias is. The variance term pen(m) 
quantifies the estimation error and in contrary to the bias term, increases with D m . It is of order 
D m /n, which corresponds to the order of the variance term on one model. These three first terms do 
not involve quantities related to the estimation error of the Lasso estimator of (3q. 

The last term precisely comes from the non-asymptotic control of \j3 — /3o 1 1 given by Proposition 
3.2. Indeed, we can rewrite Inequality (17) before using the bound of control (7): 

- «o||L] < «O m gf (IK - a m\\de t + 2pen(m)} + ^ + C 2 E[\0 - /3 0 |i]- 

This inequality makes clearer the role of the first step of the procedure in the control of the estimator 
a of the baseline function. The bound obtained for this control is of order log(np)/n, which explains 
the order of the fourth term. This term quantifies the influence of the high dimension on the estimation 
of the baseline hazard function. For small p, we obtain the expected rate of convergence in the case of a 
purely non-parametric estimation, but when is larger than n, the rate of convergence of the inequality 
is degraded. This is the price to pay for dealing with covariates in high dimension. 

Corollary 4.2. Assume that, ao belongs to the Besov space £>2 oo([0, r ]), with smoothness 7 . Then, 
under the assumptions of Theorem f.l, 

E[||d^ - aoli < Cn-^TI + C 2 C{s) l °P ^, 

where C and C 2 are constants depending on r, 4>, ||ao||oo,T> fo, E[e^o z ], E[e 2/3 o z ], B, \P 0 \ 1 , the 
sparsity index s of Po and Kb a constant from the Burkholder Inequality (see Theorem 6.9) and C(s ) 
the constant depending on the sparsity index of Po from Proposition 3.2. 

From Reynaud-Bouret (2006), we know that, for an intensity function without covariates in a 
Besov space with smoothness parameter 7 , the minimax rate is n7 27 /. We infer that this would 
also be the optimal rate in our case when the term log (np)/n is negligible, namely when p < n. 
However, when the high-dinrension p n is reached, the remaining term log(np)/n is not negligible 
anymore and there is a loss in the rate of convergence, which comes from the difficulty to estimate Pq. 

5 Applications: simulation study 

The aim of this section is to illustrate the behavior of the penalized contrast estimator oP , of the 

rhP 

baseline function in the case of right censoring and to compare it with the usual kernel estimator with 
a bandwidth selected by cross-validation introduced by Ramlau-Hansen (1983b). 

5.1 Simulated data 

Let consider the Cox model (1) in the case of right censoring. We consider a cohort of size n and p 
covariates. In the simulation study, several choices of n and p have been considered. The sample size 
n takes the values n = 200 and n = 500 and p varies between p = y/n, being 15 and 22 respectively 
and p = n, referred to as the high-dimension case. 

The true regression parameter /3o is chosen as a vector of dimension p, defined by 

po = (0.1, 0.3,0.5,0,..., 0) T € M p , 
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Figure 1: Plots of the baseline hazard function for different parameters of a Weibull distribution 
W(o,A) 


for various p > 3 and for each n and p, the design matrix Z = (Z?;.j)i<j<n,i<j<p is simulated inde¬ 
pendently from a uniform distribution on [—1,1], We consider survival times T,, i = 1 , ...,n that are 
distributed according to a Weibull distribution W(a, A), namely the associated baseline function is of 
the form ao(t) = aA“i“ _1 . We simulate three Weibull distribution W(0.5,1), W(l,l), W(3,4) (see 
Figure 1). We consider a rate of censoring of 20% and the censoring times C*, for i = l,...,rt, are 
simulated independently from the survival times via an exponential distribution ^(l/yEfTi]), where 
7 = 4.5 is adjusted to the rate of censorship. The time r of the end of the study is taken as the 
quantile at 90% of (T) A For i = 1,..., n, we compute the observed times X t = min (7), (%), 

where Ci = C{ A t and the censoring indicators 6i = The definition of C\ ensures that there 

exist some i E {1,..., n} for which X % > r, so that all estimators are defined on the interval [0, r] and 
it prevents from certain edge effect. 

Each sample (Z{, T {, Ci,Xi, Si, i = 1,..., n) is repeated N e = 100 times. 

5.2 Estimation procedures 

We implement in a histogram basis defined, for j = 1, ...,2 m , by 

= —= 2 m ^l[( J -_ 1 ) T / 2 ”»jr/2 m [(*)) 

V r 

In this case, the cardinal of S m is = 2 m and Assumption 3.5.(ii) is satisfied for <f> = 1/t. We take 
m = 0,..., [log(n/log(n))/log(2))J, so that Assumption 3.5.(i) is fulfilled. In this basis, the estimator 
is being written by 

Vt E [0, r], (18) 

jEJm 
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where 




1 


(j ~ 1 ) T \ 
2 m / 






i 


(j~ 1 ) T 

2 m 


’ 2 m 


(* 0 - 


The final estimator a, ^ is obtained from the implementation of the selection model procedure (10), 

replacing in the penalty term the unknown quantity llaollooT by IldfL . JIoot) an estimator of «o 
computed on the arbitrary larger space S' max ( m ). 

We want to compare the performances of the estimator a^ . to those of the usual kernel estimator 

rhP 

with a bandwidth selected by cross-validation introduced by Ramlau-Hansen (1983b), that we have 
also implemented. More precisely the usual kernel estimator is defined by 



1 


n x 

h(j v i =1 Aj=i e J1 


iw>w} 



(19) 


where K(u) = 0.75(1 — u 2 )l{| u i<ij is the Epanechnikov kernel and the bandwidth h^y has been 
selected by cross-validation: 



= arg min 
h 



AN(Xi) AN(Xj) 
Y(Xi) Y(Xj) 


where Y = E”=i 1{ x t >t}- 

Both estimators of the baseline hazard function are defined from the Lasso estimator 0 of the 
regression parameter defined by (3). 

The performances of these two estimators are evaluated via a random Mean Integrated Squared 
Error (MISErand) adapted to the Cox model and defined by MISErand(ai, p) = E[ISErand(a, /3)\, 
where the expectation is taken on (T t , Ci, Z{) and 


ISErand(a,/3) = 


1 71 rXi 

(«(*) - a 0 {t)) 2 eP TZi dt, 


( 20 ) 


We obtain an estimation of the MISErand by taking the empirical mean for N e = 100 replications. 


In Table 1, we give the random MISE of the penalized contrast estimator and of the kernel esti¬ 
mator with a bandwidth selected by cross-validation for different distributions of the survival times. 


First, as expected, the random MISEs are smaller for a large n and a small p. Then, we observe that 
the penalized contrast estimator performs better than the kernel estimator for the Weibull distributions 
W(0.5, 2) and W(3,4). Note that the random MISEs are very high for this last distribution. This can 
easily be explained from the fact that the baseline hazard function associated to a W(3,4) has the 
most complicated form since it increases steeply (see Figure 1). Lastly, for the distribution W(1.5,1), 
the random MISEs are smaller in the case of the kernel estimator with a bandwidth selected by 
cross-validation than in the case of the penalized contrast estimator. 
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~~~~——Distributions 

Dimensions ——— 

W(1.5,1) 

W(0.5, 2) 

W(3,4) 

n = 200 

p= 15 

0.072 

0.021 

0.626 

1.09 

5.26 

8.48 

p = 200 

0.071 

0.020 

0.613 

1.09 

5.30 

8.33 

n = 500 

p = 22 

0.055 

0.009 

0.401 

1.06 

5.24 

7.48 

p = 500 

0.059 

0.008 

0.402 

1.06 

5.25 

8.10 


Table 1: Random empirical MISE for the penalized contrast estimator in a histogram basis (first col¬ 
umn for each distribution) and for the kernel estimator with a bandwidth selected by cross-validation 
(second column for each distribution), with a Lasso estimator of the regression parameter, for three 
different Weibull distributions of the survival times. 


6 Proofs 

6.1 Technical results 

In this section, we introduce some propositions and lemmas that are necessary to prove the theorems. 
Their proofs are postponed to Subsection 6.3. 

Let us first introduce the random norm revealed from the contrast (8) and associated to the 
deterministic norm defined by (9), and its associated scalar product: for a, aq and «2 functions in 
(L 2 n L°°)([0, t]) and /3 E W fixed, 


a 


rand((3 ) 


= X] / a 2 (t)e 0TZi Yi(t)dt, 


o 


( 21 ) 


(ai,a2>rW(/3) = -J2 / ai(t)a 2 (t)e^ TZi Yi(t)dt, 

n , =1 Jo 


Subsequently, to relieve the notations, we denote ||.|| m nd := ||.|| m nd(/ 3 0 ) and the same holds for the 
associated scalar product. We state a key relation between (., .) ran d(i 3 ) and C n (., (3). By definition, for 
all m E M. n and (3 E M p , 


C n (a^,l3) + pen(m^) < C n (a^, J3) + pen(m) < C n (a%>,/3) + pen(m), 
where m& = argmin mG _A 4 ri {C' n (a^ l , f3) +pen(m)}. Now, we write that 
C n (cfie,P)-C n (afc,P) 


( 22 ) 


= - - J2 [ (a^/9 - a‘m)( t ) dN i(t) + - J2 I - a m {t) 2 )^ TZi Yi{t)dt. 

Using the Doob-Meyer decomposition, we derive that 


C n (o^,l 3 ) - C n (a^,l3) 

— — 2(a^ — a ^ , Olo) rand + 110^/3 I lrand(/3) — I l a m I lrand(/3) — 2 lS n (a^p — ), 

where r'n(a) is defined by 

1 , n ^ f T 

^n(a) = - ^2 a(t)dMi(t). 

n ti J o 


( 23 ) 
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It follows that 


C n (&^,/3) - C n (o^°,/3) = ||- Q^Hrand^) “ 2l/ n(o^ “ a m) 

5 ) rand((3) ^(^^9 ? ^o)ran<i* 

Let us now introduce the following events : 


Ai - | a 6 5„ : 


I® 11 rand 


- 1 


a 


<fet 


l 


and 12 = 


J ~F - 1 


A 


(24) 


(25) 


A 2 


a 


G S n 


a 


2 

rand(/3) 


a\ 


2 

rand 




(26) 


On the sets Ai and A 2 we have a relation between the random ||.|| ran d and the deterministic ||.||d e t 
norms and between the random norms ||.|| ra ?id an d IUI mm z(/ 3 ) respectively. The following proposition 
state a relation between the deterministic norm (9) and the standard L 2 -norm: 


Proposition 6.1 (Connections between the norms). From Assumptions 2.2. (i)-(Hi), we deduce the 
following connection between the deterministic norm and the standard L 2 -norm: 


foe 


||q| 1 ^ < 


MIL < E [ e/ 3 ° z ]IMl! 


< e B|/*o|i 


a 


2 

2 - 


The proof of this proposition is immediate using the fact that from Assumption 2.2.(ii), we can 
rewrite the deterministic norm as 

IMlL=/ [ a 2 (t)e^o z E[Y(t)\Z = z\f z {z)dzdt. 

Jo Ja 


6.1.1 Results used in the proofs of Theorem 4.1 

Recall that for all (3 G M p , 



min Sp«) > max 


/ f 0 e~ B \Po\i e - B \Po-P\i 1 \| 

l 6 


The following lemma ensures the existence of the estimators dP » on Ai n A 2 n 12. 

rhP 

Lemma 6.2. Under Assumptions 2.2. (i)-(iv), Assumptions 3.1 and Assumptions 3.5. (i)-(Hi), for 
n > 16/(foe~ 3BR ) 2 , the following embedding holds: 

AinA 2 nfiC^nO, where := n 

m^Mn 

From this lemma, for all m £ A4 n , the matrix Gp is invertible on Ai n A 2 H 12, and thus the 
estimator of cto is we ll defined. Proof 6.2 are available in Subsection 6.3.1. 
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The following proposition bounds the quadratic difference between and for m E M n , on 
the complements of 

= AinA 2 nnnn^, 

where D k H , (the indice H is for "Huang", since the set has already been defined by Huang et al. (2013)), 
is defined for k > 0 by 


for a constant C(s ) depending on the sparsity index of /3q. From Proposition 3.2, P(H^) > 1 
for a constant c > 0. Now, let us state the two following propositions. 


(27) 


— cn 


—k 


Proposition 6.3. Under Assumptions 2.2.(i)-(iv), Assumptions 3.1 and Assumptions 3.5.(i)-(iii), 


E[||d^-a^||LlKc]<ci/n, (28) 

where c.\ is a constant depending onr, f>, ||ao||oo,rj /o> E[e / 3 o z ] ) E[e 2 ^o z ], B, |/3o|i, the sparsity index 
s of /3q and Kb a constant that comes from the Burkholder Inequality (see Theorem 6.9). 


We refer to Subsection 6.3.2 for the proof of Proposition 6.3. This propositions are directly used in 
the proof of Theorems 4.1 in Subsection 6.2. 


Usually, in model selection (see for instance Massart (2007)), the penalty is obtained by using the 
so-called Talagrand’s deviation inequality for the maximum of empirical processes. In the empirical 
process (23), the martingales Mj, i = l,...,n, are unbounded, Thus, we cannot directly use the 
Talagrand’s inequality. We consider the following proposition proved in Comte et al. (2011). To 
obtain an uniform deviation of v n (.), Comte et al. (2011) have used tools from van de Geer (1995) 
to establish Bennett and Bernstein type inequalities and a L 2 (det) — L°° generic chaining type of 
technique (see Talagrand (2005) and Baraud (2010)). 

Proposition 6.4. Let m,m' E M. n ■ Define 

^m,rra'(0’ 1) = {« S S m + S' m \ ||a||det < !}■ (29) 

Under the assumptions of Theorem f.l, there exists k > 0 such that for 

p(m,m) = —(pen (m) + pen(m / )), (30) 

No 

where the constant I\q and pen(m) are defined in (16), then 

E (( SU P -p(m,m')) Iai) < — 

m'cM n 0.1) J + 7 n 

for n large enough, where C 3 is a constant depending on fo, E[e / 3 o z ] ; B, |/3o|i, ||o!o||oo,r and the choice 
of the basis. 


These propositions are applied to prove Theorem 4.1. We admit the proof of this proposition and 
refer to Comte et al. (2011) for a detailed proof of this result. 
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We need Proposition 6.5 to prove Theorem 4.1: the empirical centered process r) n (a,a%°), defined 


by 


Vn(a,aft) 


-t n 

-J2 ( U i( a > a m) -E[C/i(a,a^ 0 )]), 


where 

Ui(a,a%°) = ^j a(t)a^{t)Q^ Zi Yi(t)dt^j . 

appears in the proof of Theorem 4.1, when we control the difference between the scalar products 
.) r and — (., •} ran d($) ( see Subsection 6.2.1). Proposition 6.5 allows to control this process. 

Proposition 6.5. Let introduce the ball J3f l et ( 0,1) C S n defined by 

^ et (0,l)={aG5 n :|H| det <l}. (31) 

Under Assumptions 2.2. (i)-(iv) and Assumption 3.1, we have 

< lE[e 4 g^| 

— n (e _B l^°l 1 /o) 2 

Proposition 6.5 is proved in Subsection 6.3.3. 


E 


sup g n {a,a^Y 
a£Bi et ( 0 , 1 ) 


6.1.2 Technical lemmas for the proofs of Proposition ?? and 6.3 

In order to prove Proposition 6.3, we need three lemmas: 

Lemma 6.6. Under Assumptions 2.2. (i)-(iv), Assumptions 3.1 and Assumptions 3.5. (i)-(iii), we have 

E[\\a^ $ \\i}<C b n\ 

where C b is constant depending on ||ao||oo,T> T , E[edo z ] andE[e 2 P° z ], n b , the constant of the Burkholder 
Inequality (see Theorem 6.9 ) and on the choice of the basis. 

Lemma 6.7. Under Assumptions 2.2.(i)-(iv) and Assumptions 3.5.(i)-(Hi), we have 

c (Ai) 

P(Af)<-^-, \/k > 1, 

where is a constant depending on fo, B and |/3o|i- 

Lemma 6.8. Under Assumptions 2.2. (i)-(iv), Assumptions 3.1 and Assumption 3.3, we have for n 
large enough, 

r (A 2 ) 

P(A c 2 )<^-, \/k > 1, 

where the constant C^ 2 ' 1 depends on t, ||ao||oo,r E[e^° z ]. 

These three lemmas are required to prove Proposition 6.3. There are proved in Subsection 6.3. 
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6.1.3 A classical inequality: the Burkholder Inequality 

The last technical result is a Burkholder Inequality that gives a norm relation between a martingale 
and its optional process. We refer to Liptser and Shiryayev (1989) p.75, for the proof of this result. 

Theorem 6.9 (Burkholder Inequality). If M = (M t . J~t)t>o is a martingale, then there are universal 
constants 7 b and Kb (independent of M) such that for every t > 0 

7&II \J[M]t\\2 < ||M t || 2 < Kb 1111 2 j 

where [M\ t is the quadratic variation of M t . 

This theorem is used to prove Lemma 6.6 and in the oracle inequalities of Theorem 4.1, the 
constants depend on Kb- 


6.2 Proofs of the main theorems 
6.2.1 Proof of Theorem 4.1 

In the following, we consider the sets Ai, A 2 and P defined by (25) and (26) and the set defined 
by (27). For sake of simplicity in the notations, we denote Nfc the intersection between the four sets: 
H/s = Ai n A 2 nfln 0 ( 7 . We have the following decomposition: 


E[|I«L-«o||L] < 2||ao-amllL + 2E [||«L “ + 2E [||ajL “ “m llL%]- 

The first term is the usual bias term. From Proposition 6.3, we deduce that the last term is bounded 
by c\/n. We now focus on the term E[||o:,^ — a^°From Lemma 6.2, for all m G M. n , the 
matrices are invertible on AinA 2 nOnfl^ as soon as n > 16/(/oe -3BR ) 2 and thus the estimator 
d, 0 of uq is well defined. From (22) and (24), with /3 = f3, we have for all m G A4 n , 


I - a m\\ 2 rand{ p) < 2 ^n(a^ - af°) + 2 - a^°, a 0 - a^) rand 

+ pen(m) - pen(m^) + 2(a^ 6 - a^°,a^) rand - 2(a? B - a^°, a^°) 


rand(j3 ) 5 


where the empirical process v n (.) is defined by Equation (23) and the random norm by (21). For 
^mra'(O) 1) defined by (29), using the classical inequality 2 xy < bx 2 + y 2 /b with b > 0, we obtain 


l < 3 d/9 “ a ™llLnd(/ 3 ) - Y 6 ^n 0 ~ “mllrand + 16 ll «0 “ a m I \rand + Pen(m) - pen(m^) 

1 

+ ~ “millet + 16 SU P ^( a ) 

ib m aes det (0,1) 

m,rhP 

+ 2 ((“^ — “m J a m)rand ~ — “m j a m) r and($) 
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Consequently, using the relations between the random norms 11. | 


rand(/3) 


and II.I \rand and between the 


random norm ||.|| mn d and the deterministic norm 11 . 11 on ^k- we obtain 

1 ^ 3 A 

4 11“^ - a m\\det < -amllL + 16 ll«0-a^°||rand + Pen(m) -pen(m^) 


1 


+ ThU^/S _a ™°H det + 16 SU P y n(«) 

16 m a&B det „ (0,1) 

m.rhfl 


+ a m , a m)rand a m , a m) r and($) 

also be rewritten for defined by (30) for all m! £ M n , as 


—E 
32 


\ &P mP -a m IlL 1 ^ 


< 161|a 0 - a^\\ 2 det + 16p(m, mP) 


+ pen (to) — pen(m, /3 ) + 16 ^ E sup z^(a) — p(m, m!) 14 

m'ZMn V \oreZ3ff* ,(0,1) / , 


+ 2E 


((«^ - a m> a m>rcmd ~ " Q&°, a m) r and{$)) 1 N* 


We fix JCq > 16k such that 16p(m,m 7 ) < pen(m) + pen(m / ), for all to,to 7 in M n , so that 


32 


-E 


V0OII2 m 


IW/3 _ «m lldet^fc 


< 16||a 0 - a%>\\ 2 det + 2pen(m) 


+ 16 Y e(( sup Vn{°t) l^ fc ) 
a£B det ,(0,1) + 

tn m' ' 7 


m'&Mr, 


+ 2E 


d m , Q;^, )rand )rand( l S)') ^ 


that is 


where 


32 E [ll“i^ ~ “mlldet 1 ^] < 16 ll «0 - a m\\d e t + 2pen(m) + A{m) + E[H(to, m^)!^ 


A(m) = 16 Y E ( ( SU P v l( a ) ~ P( m > m ') ) 

m'&M n \Ws£* m ,( 0,1) / + / 

B{m,mP) = 2 ((d^ - a^°,a^°) rand - (d^ - a^°, a^) rand{ p)) ■ 

It remains to study the terms A(m) and B(m,rh 

Study of (33). According to Proposition 6.4, for n large enough 

Y E(( sup v 2 (a) -p(m,m ’)) , 

m'eMn VV«6B^ m ,(o,i) /+ / n 

where p(m,m') is defined by (30) and C 3 is a constant depending on /o, |/3o|i, B, E[e^ z ], ||«o||c 
and the choice of the basis. Hence, for C' 3 = I 6 C 3 , we conclude that 


(32) 

(33) 

(34) 


. . . Co 

Aim) < —. 
n 


(35) 
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Study of (34). Using again the classical inequality 2 xy < bx 2 + y 2 /b with b > 0, we obtain 


- a%°,a£°) ran d ~ ~ “m. 4 ) 

1 


. < Jl||a4 _ n 0 0||2 

rand(/3) — 32 11 rr+ m lldet 


+ 32 


1 u rr 2 

sup (~52 a(t)a^°(t)(e^o Zi - e /3TZi )U i (t)dt) . 


oeB det - (0,1) n i =1 J o 

m,rhP 

Now, from Assumption 3.5.(iii) and by definition (31) of B^ et ( 0,1), we write that 

SU P (-'52 f oi(t)a^(t)(e p o z i - e ^ TZi )Yi(t)dt\ 

a£B det - (0,1) \ n i= 1 “'O / 

m,rhP 

is less than 


( 36 ) 


sup [—52 a(t)a^ l °(t)e /3 o Zi (l — e^ TZi ^° Zi )>i(f)dt 

oGB**(0,l)\ n i=l- /o ) 


We have 

-j 71 „j- ^ 

52 a{t)a^°{t)e^° Zi {l — e^ T Zi ~^ Zi )Yi(t)dt 

n i =i “'o 

n 

< Wh-gM 
n “ I 
1=1 

Using the fact that |e x — e y | < \x — y\e xWy for all (x,y) E M 2 and applying Assumptions 2.2.(i) and 
Assumptions 3.1, we obtain that 


[ a(t)a^°(t)e^ Zi Yi(t)dt 

Jo 


~52 a(t)a^(t)e^ Zi {l-^ TZi 0 ° Zi )Yi(t)dt 


n—Jo 


<iY,\P Tz i-PoZi\^ TZi ~^ Zi 

n — 


i =1 
2BR 


a (t)a^(t)e^ Zi Yi(t)dt 


<Be 2BH |/3-/3 0 |i 


a(t)a^ (t.)e^ Zi Yi(t)dt 


Now, write 


sup (-52 [ «W a m(l) e ^ Zi (l-e^ TZi Zi )y i (t)dt^ 

aeZ3+*(o,i)\ n i=i “'o / 

<_B 2 e 4SR |/3 - /3 0 |? sup ~5l( i a(t)a^(t)e^o Zi Yi(t)dt\ 

aeB^‘(o,i) n i=i V- 70 / 

<_B 2 e 4Bi? |£ - /Soli sup {r? n (a,a^°) + Zl n (a,a^ 0 )} 
aes^ (0,1) 


(37) 
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where rj n {a , afy 1 ) is defined by 


Vn(a,a^) = -J2 

n *■— 


n r . 
2=1 


a(t)a^{t)e^° Zi Yi(t)dt ) — E 


a(t)a^ (t)e^° Zi Y}(t)dt 


and 


^(a,®^, 0 ) = E 


a(t)a^(t)e^ z Y(t)dt 


We first claim that the term sup aGjS detm ^{D n (a, a/^)} is bounded, by using that from the Cauchy- 
Schwarz Inequality, 


sup E 
a&B^ t { 0,1) 


a(t)a^(t)e^ z Y (t)dt 


< IWmWlef 


Thus, gathering bounds (36) and (37, we obtain that 


B{m,rhP) < ^||a^ - a&°||L + 64 


B 2 e 4BR \$-Po\i( sup {%(a,a& 0 )} + ||a^°||L 
aeB^ et (0,l) 


So, taking the expectation and applying Proposition 6.5 to control 

E [ su PaeS^(o,i )(Vn{a,a^)) 2 }, 

we get 


E[B(m,mP)tn k ] < ^E[||d^ -a^llL^J 


+64B 2 e 4BR ^K 1 / 2 [\$ - ^olfluJE 1 / 2 


sup {rj n (a,a%>)} 
a&B% ei ( 0,1) 


+ \\°&\\defi[0 ~ Po\l*x k ] • (38) 


Finally, combining (32), (35) and (38) we conclude that 


-^°IIL1kJ < 16||a 0 - a m\ldet + 2pen(m) + ^ 

+ 64B 2 e 4Bi? ||a^||LE[|^-^o|?lKj 

+ 645 2 e 4BiJ E 1 / 2 [|^ - 


n 


On Qr\£ljj, using that, from definition (15) and Proposition 6.1, ||a ^°|\ 2 det < 2| |«o |\det < E[e^° Tz ]r| |ao||oo,Tj 
we have 

Q4B 2 e 4BR ^^° u2 




and that 


fi .n2 4 BK f 1/2[|3 a |4 -11 , E 1 / 2 [e 4/3 ° Z ]| |a^° 11| 1 

64Be E'^-zSol^J - e _ B |»,| 1/o -^ 

< C(», B, |/3oli,B.E[e' , » z ],E[e 4 ' s " z ], ||aolU,x, r, 


ri\ n 
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where s is the sparsiy index of /3 q and 


C(s,B, R,E[e^ z ], ||a 0 ||oo,r,T) and C(s,B, |/3 0 |i ,R,E[eP° z ],E[e 4p ° z }, ||ao||oo,r, r, fo) 

are constants depending on the elements in brackets. Combining the previous bounds with Proposition 
6.3, we conclude that Theorem 4.1 is proved since 

E [pl§ - a m\\det] < «o inf {||a 0 - a&°||L + 2pen(m)} + — + C 2 l0g ^ , 

meMn Tl U 

where C\ and C 2 are constants depending on the sparsity index s of (3o, B, |/3o|i, E[e^o z ], E[e 4/3 o z ],||ao| \oo,r, 
t, fo- 

Ul 


6.2.2 Proof of Corollary 4.2 

From Proposition 6.1 and the proof of Corollary 1 in Comte et al. (2011), we deduce that 


K[||<, - aolli] < — — E [ll «*,/3 - «oM let] < \ Dj 1 + ^ \ + C 2 (s) 


fo 


m&Mr, 


Dn 

n 


log (np) 


n 


and since 


D r 


27 


inf < Z2 m 27 H-— > = n 2 t ,+1 , 


meAt 


n 


we finally get the corollary. 


□ 


6.3 Proofs of the technical propositions and lemmas 
6.3.1 Proof of Lemma 6.2 

Let m £ M. n be fixed and let v be an eigenvalue of G^ m . There exists A m / 0 with coefficients (aj)j 
such that G^ n A m = vA m and thus A^Gf^Am = vA^Am. Now, take h := J2j a jVj £ S m . We have 
W h W 2 r and ($) = A m G L A h and IWIi = A h A m- Thus, on Al C A 2 defined in (25) and (26) and from 
Proposition 6.1: 



2 

rand(f)) 


1 

2 


2 

rand 


1 

4 


2 > 

det — 



B\Po\i 


2 

2- 


Therefore, on Ai n A 2 , for all m £ M n , we have min Sp(G^) > foe Moreover, on 12, we have 

fo > 2/o/3 and max(/oe _3BR /6, rt -1 / 2 ) = foe~ 3BR /6 for n > 36/(foe~ 3BR ) 2 , which is equivalent on 
12 to choose n > 16/(foe~ 3BR ) 2 . □ 


6.3.2 Proof of Proposition 6.3 

We have the following decomposition : 


<n\\^ -a£°iiLiA ? ]+E[iid^ -^\\ 2 det t^ 


yPo I I 2 
*"m 

-nfi 0M2 


,/3o 112 

m I 
/ 3 o|l 2 


+ E tllW^ “ a m\\det^°\ + E[||Q! ^ ~ a m\\det^(p, k y\- 
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We deduce that 


nw^-^w Lih C j < 2 


a^ 0 - a 0 \\ det t A c\ +E[\\a%> 


° eZ 1 J 


+ E [II“^ _ a o||letlA=] + E [ll«m — aolldetlA^ 
+ E [II“^ - a o|| det^-tt c \ + E [|| a m ~ a o\\det^ c ] 
+E[||<, - a o|| det^(n k H ) c ] + E [II a m ~ a o||det^f^p])- 


,0o _ n-n II 2 , . < 11 ao 11 2 det < E[e 


.00 si 


From definition (15) of a^° and Proposition 6.1, we have ||a:^ 0 — oolliLt 
From this relation and using Cauchy-Schwarz Inequality, we have 

nwt&fi - «£°llL%] < 4E[e# z ] [e 1 /2(||^||4)( p i/2 (a c )+p i/2 (AC2) 

+P 1/2 (tt c ) + P 1/2 ((l4) c )) + I |a 0 | |i(P(A c i) + P(A^)+P(O c ) + P((l4) c ))' 

From Assumption 3.4, Proposition 3.2, Lemmas 6.6, 6.7 and 6.8 with k = 6, we conclude that 


I«0||2- 


- o&°llL%] < 2E[e^o3] 


a 


(All 


n 


6 


■ + 


Ic^ 

,6 


n 


+ \l^ + ^ 


. , l2 fC { e Al) C (A2) C 0 c s 


n° 


n 


□ 


which ends the proof of Proposition 6.3. 

6.3.3 Proof of Proposition 6.5 

The proof is inspired from the paper of Brunei et al. (2010). If we denote { ! ~Pj)j<EK. n the orhonormal 
basis of the global nesting space S n (see Assumption 3.5.(iii)), since a belongs to B Pet ( 0,1) C S n . we 
can write a(t) = Ylj£K.„ with dinicS,, = V n = |/C n |. With this definition, we obtain 

rjn(a, a£°) = ( / <Pj( t ) a % > ( t ) e/3 ° Ziy i( t ) dt [ <Pfa^(t)e^ Zi Y](t)dt 

id' n i=i Jo Jo 

-E[/ (pj{t)a^{t)e p o Zi Yi(t)dt [ ^-a^°(i)e^ Zi Yi(t)dt 


For sake of simplicity, we introduce the notation 


( t)e p ° z ’ 

x m \ L /^ 

Applying the Cauchy-Schwarz Inequality, we get 


A)j, = (pj{t)a^(t)e p o Zi Yi(t)dt ip j ’(t)a p ?(t)e p ° Zi Yi(t)dt. 


\Vn(a,a%°)\ < JJ2 a ]4 


3,3 


1 2 

\E(-L< 4 / -mY)f. 

\ 3,3' i=l 
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From Proposition 6.1, we have 


2 

sup r/ n (a, ) 2 < sup _ Blflnl i , , 2 
«GB«**(0,1) (a,),x:.a2<l(e ^ Oll /o) 2 t^ 

1 ^ /I 


1 n r 

E“?4E(-E<4/- e i4/]))' 

3,3' 3,3' 1=1 

2 


< 


(e-B|/3o|i/ 0 )2 


E(-E<4y-E[44)) ■ 


Taking the expectation, it follows that 


E 


sup %(a, a^° ) 2 
aeB^ et (o,i) 


< 


< 




sE4 


■JJ 


i=l 


1 


1, 


(e-^l^li/o) 2 4^ n 


E - E (4,/) 2 


Thus, from the definition of Aj we obtain that E[sup Q , g gd e qo,i) 7 7 ra(cb af, 0 ) 2 ] is less than 
1 1 


( e - B l^°l/ 0 ) 2 n ■“ 

JiJ 


E E 


W(*)am( t ) e ' > " Z r(t)<i* / »>j'(*)“m 0 ( f ) e ' , ° Z r(«)<it 


From Brunei et al. (2010) p.301, Equation (2.7), we have 

5Z (/ < J Q (a%>(t)e l3 ° z Y(t)) 2 dt < e 2 ^ z \\ a 


/3 o| |2 
TTl I I 2 ’ 


jeJCn 

From this inequality, we obtain 

E 


SUp ??n(a, «m ) 2 
aeB^ et (0,l) 


E[e 4 ^ z ]||agp||^ l 
— (e _ ' B l^°l 1 / 0 ) 2 n 


6.3.4 Proof of Lemma 6.6 


From Assumption 3.1, we recall that |/3 — /3 q|i < 2i?. On 77^, we have 


'<.111= E («?T = iia 

, Q 

m" 


^lli = ll(G^)- 1 r^||I 


<(minSp(G^))- 2 ||r ^|| 2 

/ 36 

< mm 


_ ’ n 5Z :E / <Pj(t)dNi(t) 


f2 e - 2 B|^oli-2B|/9 0 -^|i / ./r- V n ^ J 0 

'•'U / JfcJ „ ,3 \ 1 


< min 


36 


J2 e -2B IA|1 -4B R '") »E E 


<Pj(t)dNi(t) 
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So we have 


\a‘ 


rhP 


i 1 Q 


allz <^ n E E (/ <Pj(t)dNi{t)) <n 2 n E <Pj(t)dNi(t) 


i =i \je.JCn 


where /C n is a set of indices of the global nesting space S n , defined in Assumption 3.5.(iii), and 
dim S n = V n = |/C n |. Thus, we deduce that 


-n -t 71 / \ 4 

|d^||2<n 2 2? n -^^ ( l n <pj(t)dNi(t) 

i= 1 jeKn 


'o 


Now, 


E 


T) 

i=ljeKr, 


uE E ( / <Pj(t)dNi(t) 


o3 ™ 

4 eE e 

i=i je/c„ 

o3 ™ 

+iee e 

2=1 JG/C n 


^■(f)dMj(i) 


(t) a 0 (t) e^o Zi Tj (t) dt 


Using the Burkholder Inequality (see Liptser and Shiryayev (1989)), we get 


E 


i =i 


“E E U <Pj(t)dMi(t) 


^ \EE E 


*=i je/c^ 


- Kb n^ E E 


*=i je/c^ 


ip 2 At)dNi(t) 


N i(r) <^( s ) 

s: AA^t^O 


< Kb — y e 
n r—f 


2=1 


^(u) E E ^(«) 

s-.AN^O j&K n 


which is finally bounded from Assumption 3.5.(ii) by 


E 


n 


i= 1 j&fCn 


EE / <Pj{t)dMi(t) 


<^E E 

n z ^ 


n . , 

i=i 
at (^_\2 


IV^r) y 1 

s:AA^j7^0 


< KbFKWhij) 


Then, we can write that 


[1Vi(t)] 2 = M\(t) + [ a o {t)e 0 o z Y(t)dt 

L Jo 

< 2(Mi(r)) 2 + 2 ^ a 0 (t)e /3 ° z y(t)d^ , 


and 


E[(Mi(r)) 2 ] < E 


Q!o(i) e ^° Z U(t)dt 


< r||a 0 ||oo,rlEyo z ], 
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so that 

E[(iVi(r)) 2 ] < 2||a 0 ||oo > r'rE[e^o z ] + 2||a 0 ||^ )T (E[e^' Z ]) 2 r 2 . 
So, by using Cauchy-Schwarz Inequality, we obtain 

4-, 

1 _ u _ __ I r T \ 

E 


iyy (««(.) 

L n i=l j£K. n \ J0 ) 


<8^ 2 P 2 E[(IV 1 (t)) 2 ] + 8 Y, E 

j&ICn 


ifj(t)ao(t)e^° z Y(t)dt 


< 8 Kfe 0 2 P 2 E[(IV^t)) 2 ] + 8| |a 0 | 1^, E[e 4/3 o z ]r 2 P ri . 


2 r 2 


Eventually, under Assumption 3.5.(i), we get 

E[||6^||£] < n 2 V n |^8k6(/) 2 D 2 (211a 0 11oor tE[ e^° z ] + 211a 0 11L ,r (E [e^o z ]) 

+ 8||ao||^o,r IE [e 4 ^ Z ]T 2 ^n 

< C b n 2 Vl 

< C b n 4 , 

where C b is a constant that depends on K b . ||cko|Ioo.tj A E[e^° z ] an d E[e 4 ^o z ] and on the choice of 
the basis. □ 


6.3.5 Proof of Lemma 6.7 

The event Ai defined by (25) can be rewritten as 


Ai = < u G 12, Vo G S n \{0} : 


\a\ 


rand(u) 


- 1 


\a\ 


det 


< 


2 I ’ 


and consider 


M&) = ~J2 U(t)e^ Zi Yi{t) - E[a{t)e^ Zi Yi{t)])dt = \ Wa\\ 2 ra nd ~ \\Va\\let- (39) 


n-^Jo 

If oj G (Ai) c , then there exists a (which can depend on oj) such that 

|2 


a 


rand(uj) 


- 1 


a 


det 


1 

> 2‘ 


Taking 7 = o/||a||^ et , we have that 

7 G 5„\{0}, 11711 rfei = !) and 1117| lrarKi(«) — !! > 

So, if oj G (Ai) c , then 


cu G \ £U G 12 : 


SU P 11 In \rand(u>) 

7 e5 n \{0},|| 7 |lL=i 


- 1 
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From this, we deduce that 


P((Ai) c ) < P 


( sup |^n(a 2 )| > 1 
0 , 1 ) 



where £>^ et (0,1) is defined by (31). Since a € B^ et ( 0,1) C S n , then we can write a(t) = J2jeK„ uJVjW; 
where lC n is a set of indices of S n and dim5 n = T> n = |/C n |. With this notation, we have 

&n(a 2 ) = ^2aja k tfn(<Pj<Pk)- 
j,k 


From Proposition 6.1, we have 


sup |$r,.(a 2 )| < 
oeB^ et (o,i) 


/ oe B l^o|i (a . } ^ 


sup 


jEJCn j 


2<l' 
0 ~ 


^ ( Q‘jQ‘k'®n{}Pj l Pk) 
j,k 


Let consider the process defined by 

U i J ’ k) = [ <Pj(t)<Pk(t)e l 3 ° Zi Yi(t)dt, 

Jo 

We have \U^’ k \ < e B \Pod and from Cauchy-Schwarz Inequality, we have 

(C/P ) 2 < e 25 '^ 1 f T <p 2 i(t)dt ( T <pl(t)dt < e 2 B ^h. 

Jo Jo 

We can apply the standard Bernstein Inequality (see Massart (2007)) to the process (U^’ k> ), and we 
obtain 

P(l^n(^Vfc)| > e B \P°\ 1 x + V2e 2B \P°\ 1 x s j < 2e~ nx . (40) 

Let introduce 


0 := {Vj, k,\'&n{VjVk)\ < e B|/3o|l .T + e B|/3o|l v / 2x} and x 


/ 0 2 e- 2S l^oh 


16F>2 e 2B|/3o|i ' 


On 0, we can write that sup Q , gj gdei( 01 )|r? n (a 2 )| is less than 

1 


< 


foe~ B \Pod 

1 

/ 0 e- B l^oh 


sup ^2 + e B ^°^V2x) 


(a,j), „ a 2 .< 1 nb 

\ J— J’ K 


sup (5Zl a il) (e 5 ^ 0 ^ + e B \P°^V2x), 




which is less than 

^ 1 / e s l/ 3 oli/2 e -2B|^ 0 |i e B ^ohs/2f 0 e~ B ^o\i \ 

-f 0 e~B\Poh Dm ^ 16p2 e 2S|/3 0 li + 4V n e B \Poh ) 

<P 1 ^ 

“2 \8 e 2B l^° 

1 /1 1 
"2 V4 + 7I 

% < 41 > 
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From Inequality (41), we deduce that P((Ai) c ) < P(0 C ). So using Inequality (40), we can conclude 
that 


P((Ai) c ) < ^P^|i9 n (<^j(^fc)| > e B \Po\ 1 x -f e B ^°\ 1 V2x 
j,k 

re/ 2 e -2B|/3 0 | i \ 


< 2 vl exp ( - 


< 2nexp 

< 2nexp — 


16 P 2 e 2 S|^ 0 |l 

/o 2 


n 


16 e 4 B l^oli Vl 
/o 2 


^ Cf 1 

< -V 

n K 


16e 4B l^oli 

Mk > 1, 


logn 


as V n < y/n/ logn from Assumption 3.5.(iii) , which ends the proof of Lemma 6.7 with C^ 1 a constant 
depending on pi, /o, B and |/3 0 |i- □ 


6.3.6 Proof of Lemma 6.8 

For p 2 > 1, let define 


Af = { Vo e S ri 


\a\ 


rand(j3) 


- 1 


cv “ 

III rand 


< l- 

P2 


Let consider 


Mat) = -J2 L {ot{t)eP TZi Yi(t) - a{t)e^ Zi Yi{t))dt = Hv^llJ ond(j9) - Hv^llmnd 


n —Jo 


Following the same approach as in the proof of Lemma 6.7, we have 


P((A£ 2 ) c )<pf sup K(a 2 )|>l--V 
\aeB£ et (0,l) P 2 / 

where 1) = {a G 5 n : Halldet < 1}. The process $ n (a 2 ) is bounded by 

, Rp^l^ollp 2 ®^ 

K(a 2 )| < Be B \P°^e 2BR \p - /3 0 |i|M|| < |/3 - 


So we get 


sup \tin{a 2 )\ < \$ - /3o|i 


Se 2B|/3 0 |i e 2SiJ 

To 


(42) 


From Proposition 3.2, we have with probability larger than 1 — cn 


—k 


|/3-/3 0 |i<C(s)i 


llog(pn k 


n 
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Then we have with probability larger than 1 — cn 


sup \Ma 2 )\ < C(s)i 


llog(pn k ) Be 2B \P°\ 1 e 2BR 


a&Bf* ( 0 , 1 ) 


n 


fo 


Thus, by taking 1 — l/p 2 = C(s) 


llog(pn k ) Be 2B \P°\ 1 e 2BR 


n 


fo 


in (42), we obtain 


P((A£ 2 ) C ) < cn~ k . 


From Assumption 3.3, we deduce that for n large enough, 


so that A 2 defined by (26) verifies P((A 2 ) C ) < P((A 2 2 ) C ) < cj^^n k , with C^ 2 ' = c > 0. 


ffl 


A Prediction result on the Lasso estimator (3 of (3o for unbounded 
counting processes 

To obtain a non-asymptotic prediction bound on the Lasso estimator 0 of the regression parameter 
in the Cox model, we rely on Theorem 3.1 of Huang et al. (2013), that we recall here. 

Let consider the classical Lasso estimator $ defined by (3) when p^> n. 

We define /* (/3) = (/* d/3),..., ^ iP (/3)) T = dl^(/3)/df3 the gradient of the Cox partial log-likelihood 
lni/3) defined by (4) and l*(/3) = <9^/*(/3)/<9/3<9/3 r the Hessian matrix. 

Let us now describe the result of Huang et al. (2013), on which we rely for our study, starting with 
the notations. Let O = {j : floj / 0}, O c = {j : f3oj = 0} and s = \0\ the cardinality of O. For any 
£ > 1, we define the cone 

C(tO) = {b eW : \b 0 ch < H\b 0 \i}. 

For this cone, let us define the following condition: 


0 <k(£,0) 


inf s 1 / 2 (bi*(A))b ) 1/2 

o^bec(t,o) | boll 


This term corresponds to the compatibility factor introduced by van de Geer (2007). It is one of the 
classical condition used to obtain non-asymptotic oracle inequalities. See also Biihlmann and van de 
Geer (2009) for more details about this compatibility factor and the comparison of this criterion with 
other assumptions such as the Restricted Eigenvalue condition among other. 

With these notations, we can state the following theorem established by Huang et al. (2013). 

Theorem A.l (Huang et al. (2013)). Let k > 0 and v = B(£ + l)sT n> k/{2K 2 (^, O)}. Suppose 
Assumption 2.2. (i) holds and v < 1/e. Then, on the event 

^ = {l^(/3o)|oo<|^r„ )fe }, with r n , fc = C 0 B^-^\J 2 log ^ nh \ (43) 
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we have 


1/3 - /Soli < 


e^+1)£ r 

2k 2 (£,C) n ’ fc 


where r/ < 1 is the smaller solution of r/e v = u and Co > yrHaollcxD.rlEfe^o 2 ]. 

We refer to Huang et al. (2013) for the proof of Theorem A.l. Huang et al. (2013) have calculated 
the probability of Cl k H only in the case where maxi<j< n |W( r )| < +oo. We extend the result to the 
unbounded case in the following lemma. 


Lemma A.2. Let consider, for k > 0, the event defined by (f3). Then, under Assumptions 2.2. (i) 
and (iv), there exists a constant c > 0 depending on r, ||ao||oo,r and K[e^o z ] such that 


n(tt k H ) c ) < cn~ k . 


The proof of this lemma follows. From this lemma, we can rewrite Theorem A.l as: 

Corollary A.3. Let v = B (£ + l)sT n ^/{2K 2 {f,0)}, k > 0 and c > 0. Suppose Assumptions 2.2. (i) 
and (iv) hold and v < 1/e. Then, with probability larger than 1 — cn~ k 


1/3 - /3o|i < 


e^g+l )s 
2 k 2 (£,C) 


n,k 


with V nM = CqB 


Z + 1 L \og(pn k ) 
£ — 1V n : 


where rj <1 is the smaller solution of r/e v = u and Co > y T||o:o||oo,rlE[e^o ■ z ]. 

From Corollary A.3 and Assumption 2.2.(i) , we deduce a prediction inequality given by the fol¬ 
lowing proposition. 

Proposition A. 4. Let k > 0 and c > 0. Under Assumptions 2.2. (i) and 2.2. (iv), with probability 
larger than 1 — cn~ k , we have 

1/3 - (Soil < C(s)y^AA, (44) 

where C(s) > 0 is a constant depending on the sparsity index s. 


Remark A.5. From Proposition A.4 and Definition (27) ofLl^, we deduce that Ul k H C 

Proof of Lemma A.2 To prove Lemma A.2, we start from Lemma 3.3. p.10 in the paper of 
Huang et al. (2013), that we enounce below. 

Lemma A.6 (Lemma 3.3 from Huang et al. (2013)). Suppose that Assumption 2.2.(i) is verified. Let 
(/3) be the gradient of the Z*(/3) defined by (4). Then, for all Co > 0, 

IP (V;(A))U > C 0 Bx,J2 £ Y^dNift) < C 2 nj < 2 pe~ nx2 / 2 . (45) 

In particular, if maxj< n A 7 j(r) < 1, then P(|/* (/3o)|oo > Bx) < 2pe ~ nx2 / 2 . 

Before proving the lemma that is in interest, we recall the Bernstein Inequality for martingales 
(see van de Geer (1995)). 


29 











Lemma A.7 (Lemma 2.1 from van de Geer (1995)). Let {Mt}t> o be a locally square integrable 
martingale w.r.t. the filtration Denote the predictable variation of {M t } by V) = (M,M) t , 

t > 0, and its jumps by A M t = M t — M t ~. Suppose that \AM(t)\ < K for all t > 0 and some 
0 < K < oo. Then for each a > 0, b > 0, 


P (Mt > a and Vt < b 2 for some t ) < exp 


2 (aK + b 2 ) 


From Lemma A.6, to prove Lemma A.2, it remains to control 

P (E J q T > Cgr^ , 

Using the Doob-Meyer decomposition and since, 

71 nq- 

Y / ^.(i)ao(t)e /3 o Zi ^.(t)dt < nr||ao||oo.re B ^ 011 , 

i= i Jo 

we obtain for Co > \Jr\ |ao||oo,rP[e^o z ], 

( n rr \ ( n rr 

P [Y / > C 2 n < P ]T / Yi(t)dMi(t) > C 2 n - nr||a 0 | 




. 1 =1 ' 


v i =1 ’ 


Then, we apply Lemma A. 7 to the martingale Ya =i fo ii(i)dlWi(i), with K = 1 and 
V t = E [Y [ Y 2 (t)a 0 {t)eP° Zi Yi(t)dt < | |«o| Ioo^tE^ z ]n. 

;_-| J 0 


We obtain 


Y [ Yi(t)dMi(t) > CqU - nr||ao||oo,rP[e^ Z ]J 

n(Cl - r||a 0 ||oo,Tp[e ^ z ]) 2 


. 2=1 


< exp — 


2C$ 


Finally, we get 


P(l^(A))|oo > CqBx') <2 pe n;r2/2 + exp ( - ^^(Cq- rllaolloo.rPte^ 2 ]) ). 


Taking x = ^2 log(n fc p)/n, there exists a constant c > 0 depending on r, ||ao||oo,r and E[e^o z ] 
such that 

P m k H ) c ) < cn~\ 


which leads to the expected result of Lemma A.2. 


□ 
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